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In the following paper we consider a simulation technique for stochastic trees. One 
of the most important areas in computational genetics is the calculation and subsequent 
maximization of the likelihood function associated to such models. This typically consists 
of using importance sampling (IS) and sequential Monte Carlo (SMC) techniques. The 
approach proceeds by simulating the tree, backward in time from observed data, to a most 
recent common ancestor (MRCA). However, in many cases, the computational time and 
variance of estimators are often too high to make standard approaches useful. In this paper 
we propose to stop the simulation, subsequently yielding biased estimates of the likelihood 
surface. The bias is investigated from a theoretical point of view. Results from simulation 
studies are also given to investigate the balance between loss of accuracy, saving in computing 
time and variance reduction. 
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1 Introduction 

There is currently much interest in performing ancestral inference from molecular population 
genetic data. To facilitate this inference, there has been an explosion of research in develop- 
ing computationally efficient methods. These techniques are designed either to compute the 
likelihood, for maximum likelihood estimation, of a sample of genes or for deriving the poste- 
rior distribution on parameters in coalescent models, which describe the ancestry of the genes. 
Broadly speaking there are three main approaches to inference in molecular population genet- 
ics: (i) importance sampling for likelihood evaluation, whose application in population genetics 
was pioneered by (Griffiths & Tavare, 1994a, b,c) (ii) Markov chain Monte Carlo methods (e.g. 
Kuhner et al. (1995), Wilson & Balding (1998)) (iii) Approximate Bayesian Computation (ABC) 
(Del Moral et al. (2009), Marjoram et al. (2003)). See Stephens (2004) for a review. 

In this paper we concentrate on likelihood-based methods. Molecular data have a sampling 
distribution which is a mixture over possible aiic;cstrics. The state space of the anccstricis is huge 
and closed-form expressions are available only in the simplest cases. The objective is to calculate 
a parameter 6i e 6 C R<^« {de G Z+) such that 



for some observed genetic data y € E, parameter 9 Q Q, probability density wg on F and 
l'' : ExFxQ ^ R+ an integrable function. Note that y is typically the genetic types of a random 
sample of chromosomes. In addition, z = {zo,---,Zk) denotes the coalescent history, i.e. the 
set of ancestral configurations at the embedded events in a Markov process where coalescence, 
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mutations or other events take place. Zk denotes the current state, while zq is the state when a 
singleton ancestor is reached. 

Statistical inference associated to l{y; 9) can be regarded as a missing data problem and 
could, in principle, be tackled by the EM algorithm (Dempster et al. 1977) and its Monte Carlo 
extensions (e.g. Fort & Moulines (2003)). However, z, the stochastic tree, can be computationally 
expensive to simulate and such techniques are typically avoided. For example, for the coalescent 
(Kingman, 1982) and ancestral recombination graphs (e.g. Fearnhead & Donelly (2001)), the 
standard approach is to use IS (De forio & Griffiths, 2004a; Griffiths & Tavare, 1994a; Stephens 
& Donelly, 2000) and SMC methods (Chen et al. 2005) to approximate ([I]). These approximations 
are usually computed on a discrete grid Agi C 6 and the estimate of 9* corresponds to the largest 
approximated likelihood on A^i. See also Olsson & Ryden (2008) for an alternative procedure for 
state-space models. 

Techniques such as ABC and composite likelihood (Wiuf, 2006) do not give solutions which 
are exact w.r.t. the original model whilst, when possible, exact inference is of interest. This is 
because, given a reasonable stochastic model, the approach allows investigators to exactly (up-to 
a numerical error) average over the uncertainty in the tree structure when estimating genetic 
parameters of interest. One of the main drawbacks of existing exact IS/SMC schemes is the 
simulation of the tree backward in time, from observed data, until the tree coalesces. In many 
scenarios, especially for large data sets, when getting close to the top of the tree, it often takes 
a long time to coalesce. This is due to genetic parameters (e.g. mutation rates) that can be 
very large relative to the size of the data. Consequently, it can take a very long time to simulate 
the tree back to the MRCA. As a result, the variance of the estimate of the likelihood can be 
higher than is desirable, along with long CPU times. It should be noted that the calculation of 
the likelihood at these points, 9 £ can be inferentially important. In addition, it is seldom 
possible to speed up the simulation via importance sampling as the variance of the weights can 
become too large. That is, by adapting the parameter of the proposal to lead to a fast coalescence, 
the discrepancy between the true process and the proposal leads to a very inefficient algorithm 
w.r.t. variance. 

1.1 The Time Machine 

The approach proposed in this paper is based on IS. Stephens & Donelly (2000) proposed a way 
to use IS efhciently to simulate ancestral trees by characterizing an optimal proposal distribution 
and similar methods have since been developed for a variety of genetic scenarios (e.g. De lorio 
& Grifhths (2004a,b)). The basic idea is to define an efficient proposal distribution on ancestral 
histories which allows us to reconstruct Markov histories backwards in time from the sample y 
to an MRCA. 

We introduce a stopping time in the IS proposal, backward in time, to stop the simulation 
before the MRCA is reached. Then using a simple stopped identity, forward in time we are 
able to characterize the bias introduced in the evaluation of the likelihood due to stopping the 
simulation of the stochastic tree. The bias can be understood by considering two aspects: 

1. The underlying mixing of the evolutionary process 

2. The last exit time distributions on the process. 

In the context of ([T]), the idea is that for many models, close to the top of the tree, the process 
is able to forget its initial condition. As a result, stopping the simulation is reasonable, because 
the place where it is stopped is forgotten by the process forward in time; we formalize these ideas 
later on. In reference to ([2]), the more information there is on the true marginal distributions 
of the process, the more it is possible to reduce the bias. Ideas from the theory of population 
genetics models (Ethier & Griffiths 1987; Ewens, 1972) will be used to achieve the latter. 

In reference to a comment of Edwards (2000), our method is termed the 'time machine'. This 
is because, estimation is performed saving the simulation time of going all the way back in time 
to the MRCA. A similar idea, in the context of filtering, can be found in the work of Olsson et 
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al. (2008) and also in option pricing Avramidis & LEcuyer (2006). In our context, we have a 
simpler underlying process than in filtering, but the ergodicity conditions considered there do not 
apply here. The mixing conditions that they require only apply locally and thus the proofs have 
to be modified. Recall that approximate tools for inference from stochastic trees (e.g. Del Moral 
et al. (2009), Meligkotsidou & Fearnhead (2007), Tavare et al. (2000)) are available. However, our 
approach is 'less approximate', in that our point-wise estimate of the likelihood is significantly 
less-biased, but costing more in computational-time. 

This paper is structured as follows. In Section |2] we introduce a motivating example, the 
coalescent model, which will help to illustrate our ideas. In Section [3] our methodology is de- 
scribed; Section|4]features an analysis of the bias of the approach; Section|5]presents a simulation 
study to demonstrate the performance of our algorithm and we conclude the paper in Section 
|6j Appendix 1 contains some proofs. Appendix 2 details of our numerical implementations. Our 
ideas are illustrated in the context of the coalescent. However, the formulation is kept as general 
as possible, as the framework can be extended to other tree models, such as the infinite sites 
model. In Appendix 3 we show how this can be done. 

2 Motivating Example 

The coalescent model is used as a motivating example for our work. Some notations are first 
introduced. In particular, we consider the case in which the type space E — {1, . . . , c?}" for the 
collection of the n e 1,^ genes/chromosomes is finite and the only genetic process of interest is 
mutation. 

2.1 Notation 

Denote by {E, S) a measurable space. For two cr— finite measures Ai and A2 mutual absolute 
continuity is written Ai ~ A2 and the Radon-Nikodym derivative as dXi/d\2- Given a Markov 
kernel P : i? x — )■ [0, 1], let P^{x, ) = 5x{-), (the Dirac measure) and write the composition 
for j > 1 as {x, ■) = J P{x, dy)P^~^{y, •), with a corresponding composition of inhomogeneous 
kernels as Pi.j. Write 1a as the indicator of a set. For Card(i?) < 00 

S{E) = {P= {p^J)^,JeE : P^J > 0, ^p,;, - 1 H 3^^, > 0, Vz G ^, ^ = 1, ^P = V'} 

leE leE 

denotes the class of stochastic matrices for which there exist a stationary distribution ip. The 
collection of bounded and measurable function are denoted Bi,{E). The supremum norm is 
written ||/||oo ~ sup^.^^ 1/(2^)1- The total variation distance between two probability measures 
Al and A2 on (i?, S") is ||Ai — A2||ti; :— sup^g^ |Ai(A) — A2(v4)|. Given a probability measure A, and 
a j G Z+, the product measure is written A**^ := A'^-'"^ x A, A**"^ := 1. The vector notation x = 
{xi, . . . , Xj) = xi;j is adopted. In addition, let the d-dimensional vector Ci = (0, . . . , 0, 1, 0, . . . , 0) 
where the 1 is in the i*^ position. The Li norm of a vector is written |a;i:j|i := jxij + • • • + \xj\. 
For (ieZ+,Td = {!,..., d}. 

2.2 Identity of Interest 

Define the tree model on the measurable space (F,^), with ^ = a{F). Let n > 2. The basic 
idea is to maximize, w.r.t 9 G Q, the quantity 

liyi:n;0) ^ ^2 / 7re(zi:fc)I{yi.„}xB„+i(zfc-l,i(zfc))/''(yi:n,Zfc-i;6')dzi:fc. (2) 

where the observed data is yi-n € E, t : En —?' Z''*, normally the identity, for m € Z+ 

B,n^{xeZ''' ■.\xi.,d\i=m}. 
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and F = UfeeK,, ({^} ^ ^n) fo^' some E C F„ and /C„ C Z+ depending upon the model under 
study. In all of our examples, 7rg(zi.fe) corresponds to the density of a non-decreasing (in some 
sense) Markov process in discrete time, stopped at a random time k G /C„; that is 



Throughout the article it is assumed that X^fegK If'' ""sl-^iifc) = Ij i-C- that the stopping time 
is a.s. finite w.r.t ng. The stopping time will be determined by the first time that the tree is of 
'size' n + 1. 

Introduce an absolutely continuous distribution Qg on F and sample )i<i<Af according 

to Qg, then the IS estimator of l{y; 9) is 



N r 



^g(^l:U)^{y}xB„+i(<W-i:fc(.))^'(2/l:»,<w_i 



(i) 



where 



and 



ln,e{zi.,k, k) 



Qe{zi:k) 



the empirical measure of the simulated samples. 



2.3 The Coalescent Model 

Denote the number of genes of type i at event j of the process as zj, with Zj 
The objective is to find the genetic parameters 9 = (m, f") where fi G M+ and P E S{Td), 
@ = K"*" X S(Tii). ^ is the mutation rate per chromosome per generation and mutations along 
the edges of the tree occur according to a Markov chain with transition matrix P. 
The various components of the identity ([2| for the coalescent model are defined as: 

F = \J ({k}xF.^ 
keKn ^ 

F„ ^ {z^'-'^ e{1+ yj{Q}Y ■.2<\z^%<n + l} 



JCn = {n,n + l, 
with t the identity function, 



nyZ„z;9) = 



■} 

ji! 





otherwise 



and finally. 



T^eizi-.k) =^z:\z\i=n+l}izk)}\ Y[pgiZj-i, Zj) [ / pg{zo)pg{zo, Zi)dzo 

^3=2 ^ 



where 



pg{zj^l,Zj) 



|zj-l|l |Zj_i|i-l+/i 

|zj-l|l |zj_i|i-l+/^ 





Pa 



if 
if 



Zj-i + ei 



otherwise. 
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pI(zo,zi) ^l{z:z=2zo}{zi) and 

Pe{zi)) = 



V'fl(i) if ZQ^Ci 
otherwise. 



Write Pg(-Zi) = Jp9(zo)pg(2;oj zi)dzo (here dz is counting measure). Note that for any fixed n > 2, 
6* e 6, J2keic If'' ^e(-zi:fe)rf^i:fe = 1- For simpHcity of exposition, the results are given with only 
mutation. However, they can be easily extended to the case of migration as well (e.g. De lorio 
& Griffiths (2004b)). 

2.4 Likelihood Computation 

To compute the likelihood, for a given 9 G Q, importance sampling is adopted. An importance 
distribution, Qg, is introduced to simulate the tree backward in time to the MRCA; this ensures 
that the data is hit. 

In details, let x denote the reverse chain backward in time and write x G instead of z (this 
convention is used throughout the article, see also Figure [l]). Let: 

Ql^'-^ixi-.k-l) ^l{y'^^jixi)l YlqeiXj-l^Xj) n{cce{Z+U{0})-':x=e,,zeTa}iXk-l) 

^ j=2 ^ 

for some Markov transition qg; see Stephens & Donelly (2000) for the optimal Qg. Then the 
likelihood is 

ZC," - rij^i /■ JyT Pe{xj,xj-i) \ 



Qg^"'{xi;k-l)dXl.,k- 



1- 



The simulation proceeds by sampling from qg{yi.^, ■) and computing the weight 



wg{y^.^a,X2) 



99(^1:^,2:2) 



Simulations backward in time are carried out until we reach the MRCA, i.e. when there is only 
one individual in the sample. This procedure is repeated N times to provide a Monte Carlo 
estimator for the likelihood 

^ ^^"'^^ ^ n - 1 + » (N^ Pe(4/.,_i) 11 wg{x)l„x)>) 

^ ' ■' i=l j=2 

where {x'"^^^^i^_^)i<i<N are the simulated samples, x^i'' — y"^ for every i G Tat and 
T f I., f ""-^ n-=i(y")'l , J'tT f ^ 

ln,e[Xl:k-l,k) = < — , }-pg(Xk-lj< Wg(Xj-i,Xj) 

[n~l + ^i nl J [^J-^ 

This can be repeated for many 9 using a driving value (Griffiths & Tavare, 1994) or bridge 
sampling ideas (e.g. Fearnhead & Donelly (2001)). In addition, to deal with the problem of 
weight degeneracy (e.g. Doucet et al. (2001)) resampfing steps can be added. See, for example, 
Chen et al. (2005). 
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Backward 



Vl.d 



Forward 



^4 



Figure 1: A Coalescent tree. Here there are no mutations on the tree, and -63 is the set used to 



stop the simulation, backward in time (see Section 3.2). The horizontal dotted line represents 
the random times p and a — 1. 



3 Stopping the Simulation 

It is now detailed how we stop the simulation of the stochastic tree back in time before the MRC A 
is reached. In the next Section we provide theoretical results and connections to the theory of 
SMC are established. For the purpose of stopping the simulation, introduce two stopping times 
(forwards in time): the first hitting time of the set -B„+i 

T := inf{fc > 1 : \t{Zk)\i = n + 1} 

and some stopping time a associated to the hitting of a set A € ^ 

a inf{fc > 1 : Zfc G A} 

such that 

P„,(a < r) = 1 

where P^re is the Trg— probability. For example, in the context of the coalescent, it is suggested to 
take, for m < n 

a = inf{fc > 1 : \Zk\i = m + 1}. 

3.1 A Stopped Identity 

Let E^jj denote the expectations w.r.t the process {Zk}- Then the likelihood ^ can be written 



I{y,.,„}^B,,+AZr-lAZr)Kiu 



0) 



and applying the strong Markov property we have 



^{hyi:^}-xB^+i{ZT-i,t{Zr)}l''iyi.diZT-i; 
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that is, 



+ l:r 



(3) 



where 



The equation ([3| will be the starting point for constructing our biased estimates of the likelihood 
function. 



3.2 Coalescent Model 

Consider the coalescent model. Specifically, define, for n > m > 3 the stopping time a 

a ^ inf{/c > 1 : Zfc e B„i+i} 

which is the first time the forward process has to + 1 individuals. 
Using equation (|3|, we have 



a—m T—a-f-n—m 



dz^. (4) 



In words this means that to have to + 1 chromosomes, we need a minimum of to steps in the 
process and r has to be at least n — to + a steps. 
In this case, write 

'^eiZa) = j '^e{zi)\ W_Pe{Zi-l, Zi) HB„xS,„ + i(Za-l:Q)rfzi:a-l 

Note this is well-defined due to the fact that the size of the population is non-decreasing, and 
then, for any a £ /Cm 



T^e{Za) = / lB„, + t{Za)'^e{Za-l)Pe{Za-l,Za)dZa-l 



where 



T^e^Za-l) 



m — 1 + fi 



Y- J pI{zi)1 Y\_Ps(^i-i'^i)\pei^a-l,u)^BrnXB^+AZa-l,u)dzi,a-2du. 



That is, given a, the distribution of the chromosome counts at the first entrance time of J5m+i 
can be written as the composition of; 



• the distribution of the counts at the last exit time from B„ 

• and the Markov transition. 



Returning to the likelihood Q and making the substitutions, a' = a — 1, r] = t — a' + 1, it 
thus follows that 



— 1 



Y '^f)iZa')lB^+^{Za' + l)< Yl Pe{Zi-l,Zi)\ 

r/— n— 7n+2 a'— 7n — 1 ^ z— a' + l ^ 
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Here 77 is the time from the last time there are m chromosomes to n + 1 chromosomes. Now set 

p = M{k>l:\Xk\eB,n}. 

In other words the simulation is stopped the first time there are m chromosomes. Our approxi- 
mation of the likelihood is then 

lb{yi:n]0)= ^ / he{Xp)\ '^pg{Xi,X^^i)\l{y^^^}y^B„ + A^2A)l''{yi,d^X2-,0)dxi.,p. 
p=n-m+2 •' ^ i=2 ^ 

On the basis of the above analysis, it is then clear that if 

00 „ 

he{x)^ ^ j 1Tg{Za)\z^}{x)dZa (5) 

then the approximation of the likelihood is exact. That is, to minimize the bias an approximation 
of the true distribution of the counts at the last time there are m chromosomes should be used. 
The ideas and notation are clarified in Figure [T] 

4 Results on the Bias 

In our biased simulation, using the decomposition ([S]), the procedure will approximate 

h{Vl:n]0) j he{Xp)< \\_P0{x^,X^-l)\\yr^^}^B^+Ax2■A)l''{yl:d^X2]O)dXl,p 

where our notation is such that: 

• (yXk)^y^ is the time reversed process 

• p is a first hitting time associated to (^Xk)^^^ 

• hg an approximation of a marginal probability. 

4.1 Error Bounds 

We begin by giving a simple result on the error bounds for SMC algorithms. The result applies 
to the standard IS algorithms, for example in De lorio & Grifhths (2004b), Stephens & Donelly 
(2000), and for the SMC algorithms as in Chen et al. (2005). The simulation is to be performed 



backward in time, as in Section 2.4 The ideas here are adapted from the theory of Del Moral 
(2004). 

The biased estimates are denoted as S^{ln,e), N > 1, where ln,e depends upon whether IS 
or SMC is implemented. For example, in the IS case: 



he{xp)i Yli=2Pei^i^^i-i)\hyZ,Jx-B„+i{x2:i)l''{yi.,d^x2;9) 
ln,B{xi..p, p) = — ^ 

where 

Qeixv.p) =^y^, jxB„+A^2:i)< ge(a:j,a;.,+i) n(A-)p-^xA{xi:p) 

i=2 ' 

is such that A is the set associated to p > 3 (Qe— a.s.), xi,X2 ^ A and J Qe{xi:p)dxi.p — 1. 
Below expectations w.r.t the stochastic process that is simulated by the algorithm are written as 
E and it is assumed 

||^n,e(a;i:p,p)||oo < +00 ¥6* e 9. 
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Proposition 1. For any n > 2, p > 1, 6 £ Q, yi-n, there exists a Bp n(0) < +oo such that: 
E[\S^{L,e) - liyi:n;0)\P]'^P < ^^^^ + \hiyi..n;0) - /(yi:„;0)|. 



Remark. The result shows the standard variance-bias type decomposition. That is, Bpji{9) /V N 
can be thought of as a bound on the variance and Ihivi-.n] ^ livi-.n', 0)\ is the bias. Our estimate 
converges to Ibiyi-.n] ^) , o.'f^d it is sought to control the bias term, which, in our case can be 
approximately written in the form 

\lb{yi:n;0)-l{yi..n;e)\ = | [Ai - A2] (Pl:fe (/) ) | (6) 

for Ai, A2 two probability measures and {P„} a sequence of non-homogenous Markov kernels (9 
is suppressed on the R.H.S). 

4.2 Controlling the Bias 

A simple technical result is now given which shows how to control the bias term ([6]). 

Some assumptions are now made, that can be satisfied by many stochastic tree models. In- 
troduce a sequence of time inhomogeneous Markov kernels {Pn], on space {R,^) and a sequence 
of sets ({C„:n>0,C„e^})„>p. 

(AI) Stability o/{P„}. 

(i) Initial Probability Measures. Ai,A2 are concentrated on Cq. 

(ii) Absorption of {Pn}- For every n > 1, x E Cn~i we have 

Pn{x,Cn) = l. (7) 

(iii) Local Mixing of {Pn}- For every n > 1, there exist e„ S (0, 1), i^n concentrated 
on C„, such that for all x G C„_i 

en'^n{-)<Pn{x,-)<-'^n{-)- (8) 
En 

The assumption (-A[l]) (which is comprised of (i)-(iii)) will refer to the fast mixing of the 
process close to the top of the tree. The absorption type assumption refers to the birth process 
associated to coalescent type chains. 

Proposition 2. Assume (j^. Then, for any k > 1, define: 

2 1 - 

and we have 

|l[Al - A2]Pl:fe||t„ < ^k\\Xl - X2\\tv- 

Remark 1. The result helps to bound the bias as 

|[Al - A2](Pl:fc(/))| < ||/||oo||[Al - X2]Pl:k\\tv- 

Essentially, the fast mixing of Pn within the domain it is constrained to allow the composition 
of kernels to forget its initial distribution at an exponential rate. In addition, as in Olsson et 
al. (2008), assuming en is uniform in n, the benefits of stopping, in terms of variance/bias trade 
off can be substantial. 

Remark 2. One point of interest in the sequel is that, if the mixing condition (v^QJ) does not 
hold, it is possible to establish a similar bound when the initial measures Ai and A2 are similar. 
That is to say, when Ai A2 and 3e G (0, 1) such that 

dXi 1 
e < — - < -. 
d\2 e 

This is unsurprising as it implies that if the kernels do not mix, we need to 'match' Ai and A2 
for the bias to be small. 
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4.3 Verifying the Assumptions 

(A[T]) is now discussed in the context of the coalescent. Note that the resuhs foUow, with some 
extra work, for coalescent processes with migration. Readers interested in how the method may 
be apphed can skip to Section [5j with no loss in continuity. 

Suppose that the transition matrix P satisfies, for any i,j G e^p G (0,1) and 

probability ip, ipj > 

^^Vj <Pij < i'^Vr 

This condition implies that P mixes extremely quickly. Let Co = {zi-d : — 3}; this 

corresponds to the space of Ai = 7r|. Also let Ci = {zi-^ : 3 < jziidli < 6}. It is clear that 
Pg{x,Ci) = 1: since we start with at most 3 chromosomes and the most possible after 3 steps is 
6. Now it can be seen that, for any z G Cq 



and 
with 



ei =6 



pl{z,-)<e~,\®\-) 

/ \ 2n 3 

36^, I" inf (pfj)] I ^ 



Here the minorising probability v = ^p®^ puts all its probability on having 3 chromosomes. Then it 
can be subsequently seen that satisfies condition ([s]), with C2 — {zi-d ■ 3 < |zi:d|i < 12, rii > 0} 
and so fourth. In effect the condition ^ holds with e„ 0; that is, the closer to the top of the 
tree we stop, the faster the process will mix forward in time. 

As a result, to bound the bias we can write it, approximately, in the form, for r > n — m + I 



Af(r)||Ai-A2||t„ 



k—n 



with Ai as in ([s]), A2 = hg, M{r),R(r) G (0, 00), I € {1, . . . ,k} is associated to the fact that we 
need to iterate the kernels to satisfy (|8| and ^ G (0, 1). r is an integer big enough (say 100000) 
where we suspect that the possibility of generating a tree of length n and hitting the data is 
extremely small, so we can neglect the upper term. Thus, approximately, the bound shows that 
the bias falls geometrically as we stop closer to the top of the tree. Note, however, it cannot go 
to zero unless Ai and A2 are equal. To an extent, finding good approximations is more difficult 
than being able to stop the tree, which is why we focus on this. 

Remark 1. The result given here mirrors one proved by Donelly & Kurtz (1999) for Fleming- 
Viot models. In Theorem 9.4 of that paper they show that the particle process is uniformly ergodic, 
if the mutation process is. This is very similar to the property established above. 
Remark 2. The information, in terms of when to .stop the simulation, that is contained in the 
bound on the bias is as follows. If the mutation process mixes quickly, as above, then the bias 
falls at a geometric rate: we should stop the simulation when the process starts to mutate many 
times. This could be measured in terms of the effective sample size (e.g. Liu (2001)), if trees are 
simulated in parallel, or alternatively, if jj > n, for M G Z+ a large multiple of the current size 
of the tree. 

Remark 3. In terms of the expression ||Ai — A2||tt), one could adopt a parent-independent 
mutation (PIM) marginal. If we have 

supllpeix, ■) - pe,m{x, ■)\\tv < sup \pij = g 

X i,j 

where pg_rn{x,-) is the transition for the PIM, and the mutation vector is <j).j, then ideas from 
perturbed Markov chains (e.g. Mitrophanov (2005)) can be adopted to determine a quantitative 
bound. We are currently investigating a meaningful bound. 
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5 Simulations 

5.1 Experiment Set-Up 

To illustrate our approach, we consider three simulation scenarios: two PIM models and one 
parent dependent mutation model (PDM). The two PIM models, denoted PIM 0.5-0.5 and PIM 
0.1-0.9 are based on the following per-locus transition matrices: 

fO.b 0.5\ , fO.l 0.9\ , 
(,0.5 0.5;'^"^^ [o.l 0.9) respectively, 

while the per-locus mutation probability matrix underlying the PDM model is 

/0.5 0.5\ 

^0.1 0.9)' 

In all three scenarios, the initial population was set to 100 sequences and we considered a single- 
locus case {i.e. with 2 possible types). For the PDM model only, we also considered the case 
of 10 loci {i.e. 2^^=1024 different types). Irrespective of the number of loci considered, the 
distribution of the 100 initial sequences among the different types was sampled from a multinomial 
distribution with a probability vector P defined as the invariant point, solution of equation, 
-P(2"xi) = ^(2"xi)2^(2"x2")i where n denotes the number of loci considered and T is the full 
mutation probability matrix. 

The algorithm description is given in Appendix 2. For the function hg, we use the distribution 
of an un-ordered sample from a PIM model (which, even for the PIM cases, is not the correct 
distribution in the bias term). 

5.2 Simulation Results 

Simulations were carried out until there were TM = 1, 2, 5, 10, 25, 50 sequences left in the popula- 
tion. TM — 1 exactly corresponds the approach in Stephens & Donelly (2000) and is subsequently 
referred to as SD. For each simulation, we examined 60 values for fj, ranging from 0.1 to 30.1. 
We report in Figure [2] the estimated log-likeHhood distribution, based on 100,000 samples for all 
four simulations scenarios (presented in lines) and three values of fj, (presented in columns). 

In Figure [2] it is clear that as expected, uniformly across the values of /x, the closer to the 
MRCA the algorithm is stopped, the more accurate the distribution of the likelihood is estimated. 
However, up to TM 10% (and even TM 25% for /j, > 20) our results suggest that the time machine 
approximation and correction provides an accurate estimate of the distribution of the likelihood. 
Conversely, when the algorithm is stopped too early (TM > 25%) the biased estimator underlying 
the time machine approach leads to very inaccurate estimates of the likelihood. For even more 
extreme cases (TM 50% for ^ = 0.1), this results in a highly shifted estimated distribution of 
the likelihood. 

The above observations are also reflected in the mean likelihood (Figure |3| . For every model 
considered here, the simulations of the time machine up to TM 25% seem to provide estimates 
of the mean likelihood that are similar to the SD approach, although for larger values of /i, TM 
25% seems to overestimate the mean likelihood. Furthermore, the time machine approach seems 
to accurately locate the value of /i maximizing the likelihood for TM< 10%, and to provide 
acceptable approximations for for this when TM> 10%, regardless of the simulation scenario. 

In Figure [4j the average computation time per iteration is plotted as a function of /z for 
the PDM-10 loci simulations. Results for all other models led to the same conclusions and are 
therefore not shown. From this figure, the computation time appears to be a linearly increasing 
function of fj,: increasing the mutation rate naturally decreases the probability of simulating a 
coalescent event and therefore tends to increase the time to reach the MRCA (or any population 
size). However, it seems that stopping the simulation when there are only more than 5 sequences 
left in the population drastically reduces the computation time: for TM 5% the simulation run 
is on average more than twice as fast as the SD simulation, and for TM 25%, the time machine 
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Figure 2: Estimated distribution of the likelihood for the four simulation scenarios and for three 
values the mutation rate fi = 0.1, 10.1 and 10.1. In each model, results are presented for the six 
stopping times in the simulation of the genealogical tree. Plots are based on 100,000 samples. 



is more than 3 times more time-efRcient than the SD algorithm. It should also be noted that 
'large' values of fj, (around 10), for which the time savings are most significant, also seem to be 
infcrentially important (see the fourth panel of Figure [s]) . 

In Figure [5] the relative standard deviation across our 100 repeats of the algorithm, of the 
time machine to SD are plotted for all the scenarios considered. It can be seen, as expected, that 
there is some variance reduction and, for example for the TM 5% PDM, the variance reduction 
is of the order 1.5. 

On the basis of our experiments, combining both computational efficiency and the numerical 
accuracy, the use of the time machine with TM 5% is an efhcient alternative to the SD algorithm. 
The CH — h code is available upon request from the third author. 



6 Summary 

In this paper we have considered a new approach for simulation of stochastic trees and likelihood 
calculation of sample probabilities in population genetics models. The approach consists in 
stopping the backward simulations before the top of the tree is reached. We have provided 
theoretical results on the bias introduced in the estimation of the likelihood. Some extensions to 
our work are described below. 

Firstly, to extend our analysis to different models. The paper has been written to facilitate 
such analysis and we believe it is rather simple to deal with other stochastic tree models. Also, 
some further empirical investigations would help support the simulations and theoretical analyses 
presented here. Our methodology would be further enhanced with GPU technology (e.g. Lee et 
al. (2010)), and this is one area that we are currently investigating. 

Secondly, to look at the consistency (in a likelihood sense) of our biased Monte Carlo esti- 
mator. As we observed in Section |5] it appears that the Time Machine seems to recover the 




Figure 3: Estimated likelihood for the four simulation scenarios as a function of the mutation 
rate fi. Plots are based on 100,000 samples. 
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Figure 4: Average computation time as a function of the mutation rate fi. Figures are based on 
100,000 samples. 



maximum likelihood estimator. Therefore consistency, or potential asymptotic bias is of genuine 
interest. There are very few results in the context of consistency, due to the dependency in the 
data, after integrating out the tree. That is, it is difhcult to apply uniform laws of large numbers 
to complex dependency structures. None-the-less, we suggest the work of Douc et el. (2004), 
Fearnhead (2003), Olsson et al. (2008), Olsson & Ryden (2008) as possible starting points for a 
proof. 

Thirdly, the time machine can be used in the context of Markov chain Monte Carlo (MCMC). 
If one is interested in Bayesian parameter inference, then a stopping-time SMC algorithm can 
be used within an MCMC algorithm (particle MCMC (Andrieu et al. 2010)). Significant time 
savings per iteration can be gained by using the time machine; see Jasra & Kantas (2010) for 
some details. 
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Appendix 1: Proofs 

We give the proofs of Propositions [l] and |2] 

Proof of Proposition^ In the case of IS, the result follows by adding and subtracting lb{yi:n] 6) 
applying Minkoswki and the Marincinkiewicz-Zygmund inequality. In the case of the SMC algo- 
rithm the proof follows from the fact that the algorithm approximates a multi-level Feynman-Kac 
formula; see Chapter 12, Proposition 12.2.3 Del Moral (2004). Note that this point is appar- 
ently over-looked in Chen et al. (2005), and such a result helps to verify the convergence of the 
algorithm. In addition, note that the Proposition 12.2.3 of Del Moral (2004) does not depend 
on the importance weights being upper-bounded by 1. Hence, due to the boundedness of the 
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Figure 5: Relative standard deviation across 100 repeats of the time machine to SD. Figures are 
based on 100,000 samples. 
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weights, the same proof as for IS apphes, except the Lp bound for particle approximations of 
Feynman-Kac formulae is used instead of the Marincinkiewicz-Zygmund inequality. □ 

Proof of Proposition^ The proof is fairly simple and combines the proof of Lemma 3.9 and 
Theorem 4.1 of Le Gland & Oudjane (2004) (see also Theorem 3.1 of Tadic & Doucet (2005)). 
The idea is to use the contraction property of the total variation distance and Hilbert metric, as 
well as the relation between the two (see Lemma 6.1 of Tadic & Doucet (2005)). 

The only real complication is using the local mixing condition ([s]) to derive a bound on the 
Radon-Nikodym derivatives 

d{XlPl:k) d{\2Pl:k) 
d(A2Pl:fc)'d(AlPl:fe)- 

Consider AiPi:fc, clearly 

AiPi:fc_i(Ic,_,Pfe(/)) < -Mf)- 

In addition 

AiPi:fc_i(Icc_^Pfc(/)) = 0. 

Since 

AiPi:fe(/) > ekVkif) 

it follows that 

d{\lPl:k) ^ 1 

d{\2Pi:k) - el ■ 

The proof can then be concluded by following the arguments of Le Gland & Oudjane (2004), 
Lemma 3.9 and Theorem 4.1. □ 

Appendix 2: Algorithm Description 

Let xt = (a;i,t, . . . , Xd.t), t £ the population size within each of the d states at time t. The 
algorithm will simulate backward in time genealogical trees for an initial population, xi the 
(d— dimensional) counts associated to the observed data, until there are Ntm sequences left in 
the population. The case where Ntm—^ corresponds to ordinary coalescent and Ntm > 1 to 
the time machine. Most of the notations can be found in Sections 12.11 and 12.21 

Iterative algorithm 

For any generation t, there are \xt\i sequences left in the population, the following steps will be 
iterated until |a;(|i=iVTM: 

1. Sampling the type of the offspring sequence (i) with probability 

Xi,t 

2. Getting the type of the ancestor sequence (j). 

A sequence of a given type i can have arisen from an ancestor sequence of type j through: 

(a) a coalescent event with a probability proportional to 

\xi,t\ - 1- 



(b) a, j to i mutation event (inclusive of self mutations, from type i to type i), with 
probability proportional to: 

^IKijPj^, (9) 
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where 



l^tll - 1 + ^ 



\xt\i -1 + ^l 

3. Updating the population sizes within each type. 



if j 7^ i 
if i=j 



Xt+l = 



if a mutation has occurred 

if a coalescent event was simulated 



4. Calculate the contribution to the likelihood of the simulated event (suppressing the sub- 
script 9) 



Wt 



where 
and 



K2 Kij \xt\l 

Kl 1 Xi^t+i{\xt+i\i - 1) 

K2 Kii Xi^t{Xi,t - 1) 

Kl ^ |a:t|i(|a::t|i - 1 + 
K2 = |a;t+i|i(|xt+i|i ~1 + n). 



if a mutation has occurred 



if a coalescent event was simulated 



(10) 



5. Updating the log likelihood 



Wt 



\og{wt) ifi=0 
Wt-i + logiwt) it t > 1 



(11) 



6. Assessing the stopping criterion. 

When the time machine is used (i.e. Ntm > 1), steps 1 to 5 are repeated until > 
Ntm- Otherwise, when the full tree is simulated (7Vtm=1), steps 1 to 5 are repeated until 
there are 2 sequences left in the population. Then, mutations are simulated until both 
remaining sequences are of the same type, based on the following three steps: 

(a) Choose one of the two sequence, of type i, with probability 0.5. 

(b) Simulate the mutation event from an ancestor of type j (to type i) according to the 
probability defined in equation ([9]), and setting the coalescent probability to 0. 

(c) Calculate the corresponding weight for the sampled j to i simulated transition. At 
this final stage there are only two individuals in the population {\xt\i — \xt+i\i = 2), 
hence Ki — if 2, and = 1, then: 



When the final generation is reached Xj^t-y-x = 2, and Xj^t+\ = 1 ioi any other iterations 
in that step. The weight for each generation, derived from ( 10 ) is then defined as: 



Wt 



Xj^t + ^J-ipj 



at the last generation 
otherwise 



(d) Update the likelihood according to equation (11) 



18 



Estimating the bias 

This step is specific to the time machine (i.e. if Ntm > !)• Recahing that p is the generation at 
which the iterative algorithm was stopped, the bias induced by stopping the simulation before 
reaching the MRCA is estimated as: 

where F denotes the gamma function. The likelihood of the tree is then updated 

Wp+r = Wp + \og{b). 

Estimation of the hkehhood 

The above algorithm is independently repeated N times, the estimate of the log-likelihood is 
where W^'^^ is the value of the final weight for sample i and W = maxi<j<jv W^''\ 

Appendix 3: Infinite Sites Model 

We now consider our results in the context of the infinite sites model. We concentrate upon 
likelihoods associated to rooted genealogical trees; see Ethier & Griffiths (1987) or Griffiths & 
Tavare (1995) for more details. 

The Model 

The model is based upon the simulation of distinct DNA sequences, and the multiplicity of the 
sequences. In more details, the simulation begins with a single DNA sequence xi = (0), and 
counts ni = 2. The process can then undergo a mutation (rate /x) or a split. If a mutation occurs 
(to the ffist sequence say) we have the new state xi = (1,0), X2 = (0) and n = (1, 1), otherwise 
the new state is xi = (0) and ni = 3. 

The key point is that new mutations introduce a new site (that is a new integer number 
(which is larger than all others currently present) to the start of a selected sequence) and hence 
DNA sequence, whilst splits only increase the number of an existing sequence. The state-space 
consists of the d— distinct sequences (vectors of potentially different length sequences xi, . . . , Xd) 
and the respective counts (ni , . . . , rid) of the sequences that have been simulated. That is, in the 
previous notation 

Zi = iixi:d)i, («l:d)i)- 

The simulation stops, as before, when |(ni:d)fe|i = n+1. In general, transitions are governed by 
the following Markov kernel. A mutation (rate /u), at time j — 1, of the Z*'' sequence occurs with 
probability 

1 M 

\{ni:d)j-l\l |('^l:d)j-l|l - 1 + M 

and a split of the sequence occurs with probability 

|(rtl:rf)j-l|l - 1 |(ni:rf)j-l|l - 1 

|2;j_i|i \{ni;d)j-i\i-l + H 
see Ethier & Griffiths (1987) and Griffiths (1989) for details on the transition dynamics. 
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In this scenario, the state-space is more complicated. Let 

Ed,r,:r, = {(ri, . . . , r,) G (Z+ U {0})^ (xL^, . . . , 4,J e (Z)'-i X • • • X {{Zy-) : 

here are the lengths of the distinct sequences, and the ordering constraint notes that the 
discovery of a new site is added to the beginning of the segment vector. In addition, let 

Fn.d = {n'--'' e{Z+fn2<\n^%<n + l} 

En ^ U {dn} X Ed^^n-.r^^ X En,d„ 

then 

F - [J {k}x El 

keK„ 

There are three trans-dimensional aspects to the state-space; the time to simulate n+1 sequences; 
the number of distinct sequences and the respective lengths of the distinct sequences (which is 
determined in part by the first two aspects). 



The Bias 



For the infinitely-many-sites model, we will use the idea of the first time the number of segregating 
sites is m (or mutations here) to stop the simulations backward in time. In a similar manner 
to Section 3.2 it can be established that we want the approximating function hg{-) to be the 
marginal of the process at the last time we have m segregating sites. 

In the context of the infinitely-many-sites model, the bias is controlled by our ability to 



approximate this marginal (see Remark 2 in Section 4.2 1. This is because the Markov transitions 



can only change the multiplicity of counts, or increase the number of distinct sequences; we are 
unable to change the beginning of sequences. As a result, it is not possible to establish conditions 
such as (A[T]). 



Approximating the Marginal 

We propose the following approximation of the marginal, based upon the theoretical properties 
of such models (Ethier & Griffiths, 1987;GrifRths, 1989) and the relation to the infinitely-many- 
alleles model (e.g. Griffiths (1979) and the references there-in). Let us consider the marginal 
distribution, call it ^e- We extend the state-space to include uncertainty on d, the number of 
distinct types, c = and s the number of segregating sites, and adopt the decomposition 

(,e{zi:d, d, s, c) = £,e{xi:d\d, s)£,e{ni;d\d, c)£,g{d\c)^e{s\c)£^e{c). 

Now, under certain conditions, there are results about the exact densities (,e{ni:d\d, c) (Ewens, 
1972) and £,0{d\c) (Watterson, 1975). In the case ^e(s|n), as noted by Griffiths (1979), for large 
populations (such that diffusion results can apply) the infinitely-many-sites and infinitely-many- 
allele frequencies are not too different. Therefore, we propose to use the probability (as in Ewens 
(1972)) 

C.Hc) = ^^^|5W| 

r(^ + c) 

with s'^'^ are Stirling numbers of the first kind. 

For the quantities £,e{xi-d\d, s) and ^^(c); we use approximations. For the former, a uniform 
distribution is adopted 



ie{xi:d\d, s) 



E 

7n\,...,ms£Cd,s 



n'" 
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where 



Cd,s ■= {mus : rrii G {0,. . . ,d},mi-\ hrngS 



(^d,s} 



That is, it is a simple task in combinatorics to show that if there are s mutations with mi-s 
repetitions of mutations 1 to s, (subject to the constraint that each mutation can only occur at 
most once in each sequence and that order of allocating a mutation does not matter) then there 
are 



possible sequences; summing over all the possible multiples yields the desired cardinality of the 
state-space, ^e(c) is not known (except as the marginal of a recursion (as in Ethier & Griffiths 
(1987))) and is assigned 'Pois{nj /9) (Poisson) distribution (at time )). 

In practice, it may not be possible to evaluate some of these quantities and a further Monte 
Carlo simulation/numerical approximation (for the integral over s and the normalizing constant 
of ^0{xi;ci\d, I)) will be required. That is to say, we set 



The approximation will be different for every simulated sample. 
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